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ABSTRACT 

Ibe  ISIS  system  transforms  abstract  type  spedfications  into  faiilt-toierant  (Sstributed  impie- 
mentations,  whOe  insulating  uun  firom  the  medwnisms  wbereby  fault-tolerance  is  adneved  TUs 
paper  discusses  the  transforriationi  that  are  used  within  ISIS,  methods  for  a!±ieving  improved 
perfoimsnce  by  concumiufy  updating  replicated  data,  and  user-level  issues  that  arise  when  ESS  is 
employed  to  solve  a  fault-tolerant  distributed  problun.  We  desoibe  a  smaD  set  of  communication 
primitives  upon  whid^  the  system  is  based.  These  achieve  high  levds  of  cooojrrency  vdiile 
respecting  ordering  requironeats  impose^  by  the  caller.  Finally,  the  perforuiance  of  a  prototype 
is  reported  for  a  variety  of  system  loads  and  configurations,  in  particular,  we  demonstrate  that 
performance  of  a  replicated  object  in  ISIS  can  equal  or  exceed  that  of  a  nonreplicated  object. 

Keywords:  Fault  tolerant  distributed  computing,  replication,  concurrency,  atomic  broadcast,  resi¬ 
lient  objects,  performance. 


1.  lotrodnctioa 


Our  bask  premise  is  that  the  complexity  of  fault-tolcrsnt  distributed  programs  precludes 
their  design  and  development  by  typical  programmers.  This  complexity  seems  to  be  inherent:  sys¬ 
tems  achieve  fault-toleranoe  through  redundant  data  or  proxssing,  and  the  distributed  agreement 
and  synchronization  protocols  needed  for  this  purpose  are  hard  to  implement.  Mareover,  high 
levels  of  concurrsney  are  required  for  reasons  of  performance,  making  it  difficult  to  reason  about 
correctness  in  the  presense  of  failures.  Alternatives  to  direct  implementation  are  needed  if  fault- 
tolerant  systems  are  to  become  widely  available. 

The  ISIS  project  seeks  to  address  this  need  by  automating  the  transformation  of  fault- 
intolerant  program  specifications  into  fault-tolerant  implementations,  which  we  esQ  resilient  objects. 

In  [Birman-a]  we  fint  reported  this  work;  the  present  paper  extends  the  previous  discussion,  pro¬ 
viding  considerable  additional  detaO  and  performance  data.  ISIS  works  by  replicating  code  and 
data  while  ensuring  that  the  resulting  distributed  program  gives  behavior  indistinguishable  from  a  - 
single-site  instantiation  of  the  original  specification.  Although  many  systems  have  been  built  to 
assist  in  the  construction  of  distributed  and  fault-tolerant  software,  itiriniting  .\RG11S  [Liskov-b], 
EDEN  [Lazowska],  CLOUDS  [AUchinj,  LOCUS  [Popek],  TAK  [Spector],  and  the  TANDEM  sys¬ 
tem  [Bartlett],  ISIS  goes  furthest  in  insulating  users  from  die  details  of  fault-tolerant  program¬ 
ming.  Moreover,  ISIS  places  few  restrictions  on  programs.  In  contrast,  other  systems  that  exe¬ 
cute  fault-intolerant  specifications  fault-tolerantly,  such  as  CIRCUS  [Coope*]  and  AURAGEN 
[Borg],  are  restricted  to  programs  that  are  are  fuUy  deterministic  given  the  spediication.  In  partic¬ 
ular,  concurrent  calls  to  che  program  are  only  allowed  if  the  calling  sequezvx  is  fired.  This  lort  of 
restriction  effectively  disallows  the  d^gn  of  a  fault-tolerant  service  that  .riD  concurrentiy  be  used 
by  more  than  one  caller  -  die  primary  use  to  wtidi  resilient  objects  will  be  put. 

Rather  dian  implementing  a  specialized  distributed  protocol  for  each  algorithm  .iscded  in  the 
system.  ISIS  has  been  built  using  a  communication  layer  supporting  a  variety  of  broadcasf  proto- 
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cols.  WhDe  broadcast  primitives  have  been  known  m  the  literature  for  some  time  [Schneider] 
[Chong]  [Cristian],  the  primitives  we  describe  are  integrated  with  a  failure  detection  mechmisTn 
and  provide  unusually  flexible  delivery  ordering  properties.  Within  ISIS,  algoiithnu  are  specified 
as  sequences  of  calls  to  these  protcools,  making  it  feasible  to  reason  about  the  correctness  of  our 
code.  ISIS  would  ime  been  far  more  conplex,  and  our  code  more  error  pmx,  had  it  been  built 
directly  firom  a  lower-level  coimuunication  mechani.tm  such  as  asyudironoius  message  passing. 

It  is  somethnes  argued  that  fault-tolerance  is  best  app.'oacfaed  by  employing  speciaOy 
designed  redundant  hardware.  V/e  believe  that  software  fault-toieranvce  would  be  rra  issue  even  if 
this  were  done.  Crashes  often  stem  from  user  error  and  obscure  bugs  m  the  operating  system  and 
assodaied  software;  nr  in  high  level  application  software.  Mnreovex,  even  special 

hardw-arc  diqpeiida  on  a  stable  pov/er  source  end  air  c»T;*,diti{niag,  and  may  have  to  be  shut  down 
from  time  to  tiroe.  Tt'®,  htiJ’-ire?  seem  inevitable  in  any  distributed  systan,  and  it  is  important  to 
minimize  the  resulting  disruption.  Moreover,  we  will  show  that  the  problem  ol  detecting  and 
reacting  to  failure  is,  in  its  essence,  not  very  different  from  that  of  tolerating  more  benign  events, 
such  as  online  reconfiguration  and  synchronization  in  parallel  algorithms.  By  developing  a  tech¬ 
nology  for  software  fault-tolerance,  we  also  gain  mechanisms  for  addressing  these  other  problems. 

The  structure  of  this  paper  is  as  follows.  The  next  section  introduces  resilient  objects, 
reviews  the  object  specification  language,  and  shows  how  an  application  program  can  be  con- 
struaed  using  ISIS.  The  example  presented,  a  distributed  calendar  service,  required  just  a  few 
days  to  design,  code,  and  debug.  At  a  more  technical  level,  we  then  show  how  ISIS  compiles  a 
specification  into  a  fault-tolerant  implementation  using  the  communication  primitives  mentioned 
above.  The  paper  condudes  by  discussing  the  performance  of  our  communication  primitives, 
some  sample  objects,  and  of  the  overall  system  as  the  load  and  configuration  are  varied 


T^fere.  the  term  “hraadcast”  refers  to  a  software  protocoi  for  sending  informaden  from  a  single  source  to  a  set  cf 
destinatiaa  processes.  Such  broadcasts  might  take  advantage  cf  an  ethemet  broadcast  capability,  but  can  be  in^ement- 
ed  using  other  interconnection  devices  as  well. 
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2.  RedBent  object! 


ISIS  extends  a  conventional  operating  system  by  introducing  a  new  programming  abstraction, 
the  resiliera  object.  Fj»ch  resilient  object  is  an  abstract  type  that  provides  some  service  at  a  set  of 
sites,  where  it  b  represented  by  amponena  to  which  requests  can  be  issued  using  remote  pro 
cedure  calls  (RPQ  [Birrel].  A  typical  ISIS  application  is  constiucted  by  developing  conventional 
front-end  programs  and  ioterfsdng  them  to  one  or  more  sudt  objects.  Ihe  programmer  can  also 
define  new,  specialized,  resiliem  objects  if  liuitable  odcs  do  not  aucady  exist.  A  resilient  objc.ct 
can  be  used  for  several  purposes;  as  a  specialized  database  for  fault-tolerant  information  storage, 
as  a  soui'ce  of  statiu  information  through  which  processes  mmitor  actions  underway  at  remote 
sites  and  detect  failures  or  recoveries,  and  as  an  intermediary  for  controlling  and  synchromzing 
distributed  computations. 

The  translation  of  a  non-distributed  spedficadon  into  a  resOient  object  is  based  on  several 
assuicpticnis  about  the  environment  in  which  ISIS  will  be  used  and  wfaat  resiliency  sboulu  mean  in 
this  context.  These  are  addressed  below. 

2.1.  Failure  ftssumptioa! 

ISIS  runs  on  dusters  of  computer  systems  communicating  over  a  local  area  network.  The 
network  should  not  be  subject  to  partitionmg.  Since  local  networks  are  often  built  from  ethemets 
and  token  rings,  cr.isisting  of  interconnected  dusters  of  sites,  this  assumption  is  reasonable.  In 
case  partitioniDg  does  occur,  ISIS  has  been  designed  to  pause  when  fewer  than  half  the  sites  in  a 
duster  are  known  to  be  operational,  thus  avoiding  incorrect  actions.  Issues  relating  to  reliable 
operation  in  the  presense  of  partitioning  failures  are  being  addressed  by  some  of  our  colleagues 
[B  Abbadi-a]  [B  Abbadi-b].  We  assume  that  the  only  way  sites  fail  is  by  halting  (crashing) 
[Schlictingj;  tolerance  of  more  malidous  failures  would  lead  lo  rapid  increases  in  protcxrji  costs  at 
many  levels  of  the  system  [Strong].  Failure  detccticn  and  a  collection  of  fault- tolerant  broadcast 
protocols  are  implemented  in  software  on  top  of  the  bare  network. 


Page  21 


2.2.  ProperlleK  of  re^lltent  otjecti 

Throughout  this  paper,  the  term  resiliency  is  used  to  (tenote  k-resViency.  A  k-resilient  object 
satisfies  the  following  properties: 

1.  Consbtcncy.  The  external  behavior  of  an  is  Uke  that  of  a  non-<£stn*buted  one  which  executes 
requests  sequentially  and  to  oompledon,  with  no  mterieaving  of  execrtiom.  An  object  may 
also  istplement  distributed  synchronizstion  qteradons  that  introduce  additional  consistency  pro¬ 
perties. 

2.  AvaflabOity.  Let  /  denote  the  number  of  components  of  an  object  that  fail  simultaneously.  If 
/  s  Jt,  then  operational  components  continue  to  accept  and  process  requests. 

3.  Progresi.  If  /  £  A:,  then  operatiotis  are  executed  to  completion,  despite  failures. 

4.  Recovery.  Because  ISIS  supports  rqplication,  two  cases  can  be  distinguished: 

c.  Partial.  Uf^k,a  failed  component  restarts  autonatically  when  its  site  recovers. 

b.  Total  IS  /  >  k,  failed  components  restart  automatically  when  all  the  failed  sites  recover. 

It  should  be  stressed  that  k-resilicncy  is  a  much  stronger  property  than  resHiency  of  the  sort 
discussed  in  [Svobodova]  or  [Liskov-b].  In  both  of  these  pq>ers,  resiliency  denotes  the  ability  of  a 
system  tc  rermiin  in  an  internally  consistent  state  during  normal  execution  and  to  automatically 
recover  into  such  a  state  after  a  failure  has  been  resolved.  CXtr  approach  reflects  the  ad'Htional 
assumption  that  data  will  be  replicated,  making  it  feasible  (and  desirable)  to  continue  to  provide 
services  even  though  a  failure  has  occurred. 

2.3.  Logicad  execution  mode! 

We  decided  to  model  the  execution  of  operations  on  resiieut  objects  by  transactions 
(Eswarenj.  Altl'ough  this  precludes  supporting  some  interesting  “non- transactional”  resilient 
types,  it  is  convc.oient  because  it  permits  a  programmer  to  specify  resilient  cb^cts  in  a  straightfar- 
ward  manner,  using  a  lock-based  concurrency  control  algorithm  to  enforce  toe  transactional 
abstraction.  A  non-transactional  model  would  force  the  programmer  to  become  much  more 
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im'olved  in  the  details  of  rynchronization. 


Specifically,  the  execution  of  each  qxration  gives  rise  to  a  transaction,  which  begins 
dtly)  when  the  procedure  ^Kcifying  the  operation  is  invoked  and  oommits  (implkitly)  when  that 
procedure  returns  a  resuh.  A  procedure  can  also  abort,  winch  explicitly  causes  its  actions  to  be 
roUed  back.  Since  procedures  can  cull  one  onotha,  av  model  is  essentially  that  of  Moss’  nested 
troTMctioivi  [Moss],  altiiou^  a  wider  variety  of  kxk  types  are  anrailabie  than  Moat  (fiscussed  (Sec. 
5.1). 

Liskov  has  observed  [Liskov-b]  that  a  mechanism  is  needed  for  initiating  top-level  transac¬ 
tions  from  within  other  transactions,  in  order  to  avoid  severe  inefiideacy.  A  top-level  transaction 
is  one  that  was  invoked  by  a  non-transacdcmal  caller;  any  transaction  invoked  within  some  other 
transaction  is  said  to  be  a  subtransaetion,  and  commits  or  aborts  relative  to  its  parent  transaction. 
Liskov  points  to  a  case  in  wkkfa  the  garbage  coOecteu  during  the  execution  of  a  subtransaction 
must  be  reinstalled  if  its  caller  aborts,  and  argues  that  this  inefiidcncy  can  be  avoided  if  a  mechan¬ 
ism  for  initiating  a  top-level  transaction  from  within  other  transactions  is  available.  Top-level 
transactions  have  other  uses  as  well,  not^ly  because  they  pemui  increased  uoncurrency  when  tran- 
sactionally  updating  a  concurrently  accessed  data  structure.  Here,  an  update  may  afiect  both  the 
linkage  fields  of  the  data  structure  itself  and  the  contents  of  some  record,  and  it  is  desirable  to 
view  these  as  independent  events.  By  using  top-!ev'd  transactions  to  update  link-fields,  other  con¬ 
currently  executing  operations  are  permitted  to  “pass  through"  a  node  while  it  u  still  being 
updated,  without  waiting  until  the  update  trauMction  has  terminated.  Operatiotis  that  try  to 
access  tine  record  contents,  hov/ever,  block  until  the  update  tmmaciion  oomciits  or  aborts.  For 
these  reasons,  we  inrtnfiiyj  g  top-level  transaction  mechanism  in  ISIS. 

Another  possibility  was  to  support  an  u:uio  mechanism,  which  would  permit  the  programmer 
to  associate  an  arbitrary  undo  action  with  each  operation.  If  it  becomes  necessary  to  abort  an 
operation,  the  conesponding  undo  action  is  executed.  Unlil.e  t!ie  syslem-levei  abort  feature  now 
supported  in  ISIS,  an  iindn  mechanism  could  be  used  to  bade  out  of  actions  that  have  external 
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sidescffects.  Jdbwever,  abort  is  rsre  in  our  prototype,  and  we  decided  not  to  implement  undo 
actions  at  the  present  time. 

3.  Object  qwdllcatloa  langnage  and  tystoB  btsr&oa 

b  this  seedem  we  review  the  language  used  to  specify  ISIS  objects,  Uk  mteifaoe  provided  to 
external  callen.  A.  command  language  is  also  avaOabie,  and  is  used  to  control  the  ISIS  sytttem 
itself  during  execudoo,  but  is  not  described  m  dtetafl  here. 

3.1.  Raflient  objccfjt 

The  k-nsiUtnt  types  are  a  special  dass  of  abstract  data  types  [Liskov-a].  Each  resDient  object 
instantiates  a  resilient  type  and  is  tcessible  to  holders  of  a  cepability  on  it;  these  are  open  m  that 
they  can  be  freely  oqried  or  stored  [Dietrich].  Resilient  ts^pe  spedficaiions  have  the  folluwing 
parts; 

1.  Dedarations  for  the  resilient  data  encapsulated  by  the  type,  consisting  of  one  or  more 
indefinite-length  arrays  or  heaps^  of  resilient  records. 

2.  Type  defuiidons  for  the  parameters  and  results  of  operations. 

3.  Procedures  for  manipulating  resilient  data.  These  can  be  given  attributes  such  as  create  (exe¬ 
cuted  when  an  instance  of  an  object  is  created),  entry  (accessible  to  external  caOers),  and  read¬ 
only  (does  not  iqxiate  resilient  data). 

Resilient  procedures  me  coded  using  a  version  of  the  C  programming  language.  All  of  C  is 
available,  as  are  many  operating  system  calls.  The  language  has  been  augmented  to  indude  a 
multi-tasking  facility  for  internal  concurrency  and  to  provide  several  new  statement  types: 


’Sequentially  allocated  dau  structures  tend  to  have  “hot  spets"  which  are  frequently  accessed,  reducing  potential 
concurrency  [Gawlick].  The  heap  management  fadhty  supports  dynamic  allocatioa  and  deallocacon  of  resilient  records 
within  transaedens,  avoiding  a  common  source  of  hot-spots.  lieap  management  is  done  using  psr-transaoicn  allocadon 
and  free  lists,  which  are  updated  when  a  heap  allocarion  cr  tree  is  done  and  when  a  subcransacdon  conarits  or  aborts. 


1.  no  statemena.  ResDieait  data  is  aocxssed  using  read  and  write  statements,  which  can  also 
spsdfy  a  lode  to  acquire  before  performing  the  access.  By  forcing  the  {vogrammer  to  use  a 
spedai  notation  when  acofs^ing  resOient  data,  the  most  natural  programming  style  tends  to 
minimire  those  operations  -  and  this  is  also  the  most  efiCdent  way  to  use  ISIS. 

2.  Rermu  procedure  cedis.  A  flsxibie  RFC  medianism  is  prosided,  inchiding  nested,  rvicursive, 
and  asynefaronous  RFCs,  as  weO  as  RFCs  in  which  the  function  to  call  is  a  paramster.  It 
ihoiUd  be  noted,  hoivtntr,  titat  there  are  some  techitical  reatrktions  oo  the  use  of  recursive  sjid 
asynchronous  RFCs  that  stem  from  the  model,  hence  it  is  not  dear  wivsther  typical  usen  wiD 
make  use  of  either  feature.  RPC  is  also  used  to  as  an  interface  to  most  ISIS  s^'stem  functions. 

3.  Abort  return.  A  imrrial  return  from  a  resilient  prooeduru  U  interpretsd  as  a  cotmit  c£  the  sub* 
transaction  that  was  being  executed.  In  an  abort  return,  the  effects  (rf  the  procedure  (and  any 
that  it  has  invoked)  are  erased. 

4.  Cobegin.  A  set  of  branches  (statements,  which  do  not  contain  return  or  abort  statements)  are 
spedfied  for  concurrent  execution.  The  cebegin  terminates  when  all  its  branches  terminate. 
Each  cobegin  branch  executes  as  s  task  within  a  single  address  spacb.  A  morr  ucsble  cobegin, 
winch  might  provide  some  form  of  esplidt  cmtrol  over  concurrent  processing  at  multiple  sites, 
is  under  consideration.  The  statement  is  currently  used  to  IcKp  a  computation  active  while 
some  branch  is  blocked  (e.g.  when  accp’lnng  a  lock).  ISIS  assumes  that  the  branches  of  a  cobe¬ 
gin  do  not  compete  for  locks  on  the  same  data  items. 

5.  Toplevel.  The  statement  is  executed  as  a  top'levei  transaction. 

3.2.  Feahuvs  for  nwoltorhig  ranote  actlvltlca 

Although  resilient  objects  provide  a  simple  mechanism  for  concealing  replication  and  distri¬ 
bution,  they  will  also  be  used  to  explicitly  synchronize  and  control  distributed  computations.  As  a 
result,  features  have  been  added  to  the  language  that  permit  a  computation  to  pause  vnti.'  an 
update  to  a  replicated  variable  has  been  received,  or  until  one  of  the  sites  at  which  an  object 


resides  failure  or  recovers.  A  predefined  SITES  variable  can  be  read  to  determine  the  fuD  set  of 
sites  at  vvhich  an  object  resides.  A  VIEW  variable  gives  the  subset  of  SITES  that  are  currently 
('perational.  By  combining  these  medianLsms,  it  is  easy  to  buOd  objects  'vith  very  sophisticated 
distribjtexl  functionality. 

Zjt.  It  pragraBM  external  t>  ISIS 

Non-resilieat  dients  interact  with  resilitsit  objees  through  an  interface  that  resembles  die 
one  used  by  objects  to  communicate  with  one  another,  by  issuing  RFC  calls  to  objects  on  wbidh 
they  hold  or  can  obtain  a  capability.  ISIS  supports  a  globally  accessible  name  space  object,  which 
has  a  well-known  capability,  and  can  be  used  to  look  up  a  desired  object  using  a  symbolic  name. 

Normally,  eadi  iU*C  issued  by  a  dient  executes  as  a  top-level  transacdon.  A  dieot  can. 
however,  expliddy  combine  a  series  of  requests  into  a  transacdon.  To  do  this,  du;  program  first 
invokes  a  BEGIN  procedure,  then  perform*.  ttiU  lipercdcms.  arjd  then  invokes  COMMIT  or  ABORT. 
A  dient  can  only  have  one  active  tramacdoc  at  a  time. 

If  a  dient  fails  wtulc  a  transacdon  a,  progress,  the  d'.;fdul1  oedon  is  Vj  terminate  the  tran¬ 
saction.  Observe  that  the  semandcs  of  tenninadon  in  this  case  differ  from  those  for  abort.  This  is 
because  an  abort  is  explicitly  executed  by  the  computadon  to  be  aborted,  whereas  when  a  dient 
fails  it  may  be  aecessary  to  interrupt  a  comp*jtadcTi  that  is  still  in  progress,  or  blocked  (perhaps 
deadlocked).  To  this  end,  ISIS  suppertv  a  software  kIE>  signal,  which  can  be  issued  by  a  dient  pro¬ 
cess,  and  io  automatically  issued  when  a  dient  fails  before  terminating  a  transaction.  KOI  cannot 
be  caught  or  ignored,  and  tenninates  a  iransaction  by  halting  it  and  its  subtrausactions  and  then 
aborting  them. 

Transactional  sys  ms  that  lade  a  mechanism  for  ensining  continued  progress  despite  failure 
generally  implement  a  variant  of  kill  to  tennu^ate  transactions  that  sre  inter.'uirted  when  a  site  at 
which  'iiey  executed  fails  (this  gives  rise  to  the  orphan  termituman  problem  dfs^xiiised  in  [J/Ioss] 
[Liskev-b]).  Software  built  using  these  systems  must  avoid  irreversible  acticjns,  like  cuvement  of 
a  methanical  arm  or  dispensing  cash  from  a  machine,  because  the  only  time  such  an  action  can 
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safely  be  taken  is  during  the  tcp-level  cxxsxsit  (when  kill  can  no  longer  occur).  IMortunately, 
this  tactic  maket  it  hard  to  implement  certain  types  of  operation,  for  example  one  that  moves  a 
niechanical  arm  while  monitoring  and  reacting  to  sensor  feedback.  Here  the  desired  behavior 
could  only  be  nbtidned  by  breaking  the  operation  into  a  sen*''  nf  sqnrate  transactions.  The  pro¬ 
grammer  would  then  implement  his  own  algorithms  to  cc^  wiu^  failures,  a  nomrivial  underta'  ng. 
Because  it  uses  a  progress  XSZS  never  invokes  kfli  automatically  unless  a  dient-level 

transaction  fails.  Thta,  an  operation  s^sch  as  this  is  easily  impla'iented  within  a  resQient  object. 
Qearly  such  an  object  must  be  deadlodc-^ee,  but  this  is  a  minor  restriedon  (for  example,  lodes 
can  simply  be  acquired  in  a  Gzed  order). 

4.  Programming  Sta  ISIS 

This  section  summerizes  the  developmeait  of  a  distnhut.xl  appotntmest  and  calecdar  syston, 
whi^  was  undertaken  to  exerdse  ISIS  and  to  gain  some  programming  experience  using  resilient 
objects.  The  calendar  combines  data  storage  with  a  dynamic  monitoring  capability:  users  who 
chose  to  dbpla)  their  schedule  on  a  terminal  are  shown  changes  as  tlwy  occur. 

To  devdop  the  calemlar  prograai,  we  began  by  ctesigning  and  implementing  a  conventional 
single-user  program  with  the  same  functional  decomposition  that  we  intended  to  employ  in  the 
fault-tolerant  distributed  version.  This  program  contains  an  interactive  display  module,  a  com¬ 
mand  interpreter,  and  a  collection  of  procedures  for  managing  the  calendar  data  structure.  The 
data  structure  consists  of  an  array  of  per-user  infcimaticn  (PUI)  blocks  and,  for  each  user,  a 
linked-list  of  appointment  schedules.  The  command  interface  permits  a  user  to  define  a  new 
group  (a  set  of  users  and  a  reason  for  dieir  meetings)  or  schedule  a  meeting,  and  can  provide 
advice  to  a  usri*  ’<vho  wishes  to  schedule  a  meeting  with  some  group  but  is  unsure  what  time  to 
pick.  A  graphic  display  of  ?.  'iser’s  sdiedul?  is  dso  provided. 

Having  completed  this  initial  version  of  the  calendar,  we  modif  ed  the  calendar  dfctahase  into 
a  resilient  object.  Tlie  object  supports  operations  to  fetch  or  up(iate  the  PUI  for  a  user  or  group, 
fetch  or  update  the  entry  for  a  week,  and  to  pause  until  a  change  to  the  schedule  for  the  current 


is  detected  (this  enables  the  program  to  refresh  the  display  when  the  schedule  is  changed  by 
some  other  user). 

Gmcurrency  control  proved  to  be  straightforward  to  implemeait.  We  divided  commands  into 
two  types:  read-only  requests  and  update  requests.  Each  is  executed  as  a  tranaactioa  Locking  is 
dcae  on  just  the  FUI  bloda,  and  these  are  locked  in  a  fixed  order,  a  strategy  which  is  deadlock 
free. 

For  reasons  of  perforiEanoe,  it  was  desirable  to  cache  infonnados  in  the  interactive  firont- 
end  progranK.  A  version  number  was  therefore  associated  with  eacli  FIJI  blodr  and  incremented 
by  update  transactions.  All  calendar  mfermatien  except  the  PUI  is  cached.  Each  timri  a  F.E  block 
is  referenced,  any  invalided  cache  entries  for  that  user  are  dBscarded.  hi  practice  it  seems  likriy 
that  cache  entries  wili  normally  be  accurate  snnply  because  calendars  are  consulted  more  fre¬ 
quently  then  they  are  updated 

To  summarize,  we  found  it  easy  to  implement  a  nontrivial  .'’dstributed  program  t«ing  ISIS. 
The  result  is  not  of  production  quality,  but  the  remaining  issues  are  in  the  calendar  interface,  not 
the  feasibility  of  implementing  the  calendar,  and  the  program  v/ns  never  subject  to  concurrency 
related  bugs  or  problens  with  failure  handling  and  recovery.  Has  would  rtxtainly  not  have  been 
the  case  in  a  development  starting  with  the  basic  UNIX  interprocess  communk^on  primitives.  In 
fact,  our  program  was  oonvated  by  the  author  from  a  single-site  verion  into  a  distributed  one 
within  a  few  days.  Moreover,  the  performance  of  the  resulting  system  is  reasonable;  although 
there  may  be  a  delay  of  several  seconds  before  infoimarion  from  sn  complex  update  is  reflected  k 
remote  copies  of  the  calendar,  users  of  the  system  are  unaware  of  this,  since  their  local  calendar 
reflects  updates  rapidly.  It  effect,  the  program  gives  a  speedy  rrspeme  and  then  cmupletes  the 
the  request  in  the  “backgnjunii”. 


5.  Ronthae  issoes 

In  this  section  we  tem  to  the  runtime  mechanisms  that  underlie  the  implementation  of  a  resi¬ 
lient  object.  These  arc  nontrivial  bccau.se  of  tlie  rnan>'  “physi*  ev'cnts  tliat  can  occur  Airing 
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Txecuticn.  Our  treatment  begins  by  surveying  the  algoiithiiis  without  addressing  details  relating  to 
their  implementation  twing  broadcast  primitives.  Sec.  5.2  introduces  the  primitives  actually 
employed  within  ISIS,  and  See.  5.3  then  shows  how  some  of  the  algorithms  of  Sec.  5.1  can  be  effi- 
dently  implemented  using  them. 

5.1.  Fanlt-^slenat  cxiecntlaa  of  a  request 

We  use  the  term  task  to  refer  to  the  {Aiyakai  execution  of  a  requ^t  by  oae  of  the  com¬ 
ponents  of  a  resilient  object,  designated  tte  coordiiuaor.  Components  are  idaitical:  any  can  be 
coordinator  for  any  request,  which  tends  to  distribute  processing  load.  The  components  that  are 
passive  for  a  request  are  designated  as  cohorts  and  serve  as  badcucs  ••  one  takes  over  if  the  coor¬ 
dinator  fails.  Recall  from  Sec.  2.2  that  a  task  must  satisfy  several  properties  to  produce  a  coned 
logical  executiem.  We  r>ow  insider  these  properties  individually. 

5.1.1.  Consfateocy 

Couaui<.~w7  control  [Bernstein]  is  not  automatic  in  ISIS,  because  it  is  difficult  to  infer  an 
efficient  ooncunency  amtrol  algorithm  without  krtov/ledge  of  the  semantics  cf  cpck'atioQS.  llsere- 
fore,  ISIS  requires  that  the  programmer  provide  a  single-site  concurrency  control  algorithm,  winch 
is  transformed  into  a  distTibuted  one.  The  class  of  concurrency  control  algorithms  supported  are 
the  conflict  serializaiion  algorithms  [Papa],  of  winch  2-phasc  lodring  is  the  be^i  known  and  essiest 
to  code.  Two  lock  classes  are  supported  (see  below);  within  each  class,  read,  promotable 
(exclusive)  read,  previous  committed  version  read,  ana  write  locks  cm  be  requested. 

The  previous  committed  version  lead-lodc  is  imusual,  and  deserves  further  explanation. 
!..ocks  of  this  sort  permit  a  read-only  transaction  to  execute  cx^scurrently  with  one  that  is  updating 
some  of  the  data  items  it  accesses.  Denote  sudi  a  read-only  transaction  R  and  an  «ipdate  transac¬ 
tion  with  wbich  it  ccnfucts  U.  UR  reqxicsts  a  read  previous  lock  on  some  item .t,  that  lock  on  be 
grar.ted  even  if  U  already  has  wriie-locked  x,  and  the  subsequent  read  request  by  R  from 

the  la^it  committed  version  of  .v.  Should  U  attempt  to  commit  before  /i  does  so,  U  now  be 
forced  to  w'sit  until  R  commit:  and  releases  its  locks.  Thm,  if  R  reads  other  records  that  U  b 


updating,  it  will  consistently  see  the  versions  committed  before  f/  began  exL^ting.  In  effect,  R 
has  been  serialized  before  U  although  it  started  execution  after  U,  and  neither  R  not  U  is  forced 
to  block  unto  U  reacfas  its  comnit  poiiu.  A  Mwiilar  form  of  axacurrency  control  was  described  in 
[WeOO],  where  kyMd  atomicity  was  in&oduced  to  cqnure  the  behavior  of  a  special  “timestamped” 
concurrency  control  method  that  also  permits  read-only  transactions  to  run  "sing  previous  versions 
at  data  items.  Snoe  many  transactions  are  read-only;  previous  copy  locks  OKJuld  be  very  vahiable 
in  systems  such  as  ISIS. 

The  two  lock  dassea  supported  by  ISIS  are: 

1.  Nested  2-phase  locks.  When  a  subtransaction  commits,  the  lock  is  retained  by  its  parent  tran¬ 
saction  [Moss];  when  the  top-level  commits,  it  is  released. 

2.  Local  2-;^iase  locks.  When  the  transaction  or  subtransacdon  holding  the  lock  cmnmits,  it  is 
released. 

Nested  2-phase  locks  are  easiest  to  work  with,  and  probably  suffice  for  most  users.  On  the 
other  hand,  by  using  local  locks  in  conjunction  with  top-levd  transactions,  it  is  poss  lie  to  imple¬ 
ment  highly  conr«t/"Tent  data  structures,  and  this  approach  was  used  successfully  within  the  ISIS 
namespace  object.  The  annbin.ition  of  lock  dasses  and  types  makes  ISIS  exceptionally  flexible  at 
the  level  of  asncurrency  control,  and  we  Ijelieve  that  cffldoit  concurrency  oontioi  algorithms  can 
be  devised  for  most  objects. 

5.1.2.  Avaflabflity 

Availability  is  satisfied  by  rqilicar'Vig  the  code  and  data  for  each  resilient  object.  Since  data 
accesses  arc  transactional,  each  item  is  represented  as  a  stadc  of  versions  [Moss],  iqilicated  at  k-f  1 
or  more  sites.  A  read-one  copy,  write-all  (operational)  copies  rule  is  used  when  loddng  or  access¬ 
ing  replicated  data.  An  item  is  updated  by  pushing  a  new  version  on  all  copies  of  the  corresp  - 
ing  stack,  or  replacing  the  top-most  version  if  one  already  exists  for  the  transaction  doing  the 
update  Abort  is  implemented  by  pepping  the  top  version,  and  commit  by  popping  t!»e  top  two 
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and  then  pushing  the  first  again.  The  best  known  alternative  to  the  read-one,  wnte-all  approach  is 
to  employ  s  quorum  access  rule,  where  both  read  and  write  requests  are  satisfied  by  aocRVung 
multiple  copies  [Gifford]  [Herlihy].  We  rejected  this  because  the  latency  incurred  while  waiting 
for  a  quorum  of  reqxmses  from  remote  sites  reduces  the  level  of  conainency  below  tuat  which  we 
attain  using  the  read-one  write-all  approach,  where  computation  can  proceed  without  delay  as  soon 
as  the  local  copy  of  a  data  hem  has  been  accessed  (evidence  to  support  this  coodusion  is  given  in 
Sec  7).  Qurorum  methods  are  preferable  if  network  partitioning  is  common,  because  they 
increase  availability  [Herlihy]  [B  Abbarfi  a]  [B  Abbad-b],  however  this  is  not  felt  to  be  an  issue 
in  the  cnvironmcni  tor  which  ISIS  was  designed. 

5.13.  Progma 

ISIS  ensures  that  operations  progress  to  completion  using  a  transactional  chec^xnnt-rcstart 
scheme  [Birrcan-a]  [Birman-b].  Related  v.’ork  on  non-transactiona!  chedqpcint  and  restart  appe.ir* 
in  [Toueg][QiBndy].  Each  RPC  is  I'rcadcsst  to  the  operational  components  of  an  object,  and  con¬ 
stitutes  an  initial  checkpoint.  If  a  coordinator  fails  while  executing  the  request,  one  of  its  cohorts 
takes  over  as  the  new  coordinator.  It  restores  its  copy  cf  the  object  to  the  state  at  the  time  of  the 
checkpoint,  discarding  versions  of  data  items  that  were  written  by  the  transaction  being  re:.tarted 
(this  requires  no  communicatior.  ^ith  other  components  or  objects).  The  actions  of  the  failed 
coordinator  are  then  repeated  in  restart  mode  in  order  to  reestablish  the  state  that  existed  at  the 
time  of  the  failure. 

When  an  operation  is  reissued  in  restart  mode,  it  dearly  shrnild  not  be  fe-exeoited  -  other¬ 
wise,  the  s>’3tem  state  could  become  inconsbtent  (e.g.  if  an  incremesnt  were  done  twice).  To  avoicl 
this,  the  results  returned  by  completed  operations  are  rq^licated  in  addition  to  the  updates  they 
perform  on  replicated  data.  Spedficclly,  when  a  coordinator  finishes  executing  a  request  it  broad¬ 
casts  tne  result  to  its  oDhorts  as  well  as  to  tlie  caller.  The  cohorts  retain  copies  cf  each  result 
under  the  TTD  of  the  transaction,  constituting  a  final  checkpoint.  Because  the  same  TTD  is  used 
during  restart  (see  Sec  5.3  for  details),  when  the  operation  is  reissued  a  copy  cf  the  retained 
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result  can  be  located  and  relumed  (this  is  done  by  the  process  that  would  normally  have  esecuted 
the  request  in  the  called  object).  If  none  is  found,  a  restart-mod;  request  is  rejected.  The  coordi¬ 
nator  performing  the  rmtart  deduces  from  this  that  normal  execution  should  resume. 

The  storage  overhead  associated  with  our  method  is  low.  Retained  results  can  be  discarded 
when  tte  parent  of  a  subtransaction  commits  or  aborts,  retaining  its  own  result,  or  when  the  top- 
level  commits.  Cn  the  other  hand,  since  top-level  statements  are  ro-executed  during  restart,  the 
results  of  the  top-mest  action  in  such  a  statoneat  must  be  retained  Also,  if  a  resilient  object 
takes  external  acdons  like  moving  a  robot  arm,  the  robot  arm  must  provide  a  function  equivalent 
to  a  retained  result  -  for  example,  a  way  that  the  device  driver  can  determine  the  command  it  last 
executed  and  in  this  manner  identify  a  duplicated  command  (a  minor  restriction  since  most  devices 
of  this  sort  contain  microprocessors  tmd  memory). 

While  restarting,  it  is  not  enough  for  the  new  coordinator  to  determine  the  results  returned 
by  Operations  that  were  previously  executed.  The  serialization  order  must  also  be  the  same  as  was 
used  before  the  failure  -  otherwise,  the  values  read  from  resilient  data  itcew  oy  the  new  coordina¬ 
tor  might  differ  from  those  read  by  the  previous  one,  agrin  leading  to  mconsistendes.  ISIS 
addresses  this  issue  by  replicating  both  read  and  write  locks,  so  that  after  a  failure  the  new  coordi¬ 
nator  holds  all  the  locks  acquired  by  the  previous  coordinator  before  it  failed.  Because  replicating 
read-lock  informaticn  is  potentially  inefficient,  the  approach  is  to  piggyback  this  data  on  othta' 
messages  that  could  depend  on  their  existence  -  RjPC  requests  and  updates  issued  subsequent  to 
the  acquisition  of  the  lock.  If  an  RPC  or  update  persists  after  the  aasb,  the  corresponding  locking 
information  persists  as  well.  This  information  is  forwarded  to  the  new  coordinator  before  it  Is 
informed,  of  the  failure,  which  therefore  registers  the  locks  prior  to  initiatmg  restart  [Eirman-b]. 
The  reader  may  be  troubled  by  the  apparently  complex  synchronization  requirements  of  this  algo¬ 
rithm:  read-locks  must  be  legistered  before  restart  begins,  and  the  consistency  of  the  system  state 
must  l)c  maintained  after  failure.  We  t’;ow  in  the  next  section  that  thece  problems  can  both  be 
lesnlved  in  an  elegant  manner  within  the  communication  primitives  themselves. 


5.1.4.  Recorcry 


If  a  paHial  failure  occurs,  a  failed  component  can  recover  by  discarding  its  old  state  and 
copying  the  cuirent  state  from  some  operational  component.  In  effect,  the  components  of  an 
object  act  as  dynamic  bacimps,  eliminatirig  the  need  for  stable  (disk)  stmage.  Later,  we  will  show 
that  the  communicatias  primitives  can  be  used  to  serialize  recovery  with  respect  to  other  opera* 
rions.  To  tolerate  total  failuie,  an  object  must  save  efaedqxaints  and  committed  versiois  of  the 
object  data  on  stable  storage.  When  the  components  that  faOed  last  have  recovered  [Skeen-a], 
they  can  resume  operation  from  their  stable  stores;  other  components  use  the  partial  recovery 
method. 

Some  care  is  needed  in  deddin/j  what  information  should  be  placed  in  stable  storage,  since 
this  approach  win  be  costly  if  access  to  stabk  storage  must  occur  frequently.  For  eanenple,  con¬ 
sider  a  request  to  insert  a  data  item  into  a  complex  data  structure.  The  message  containmg  the 
request  may  be  tiny,  but  massive  changes  to  the  data  structure  could  result.  If  ISIS  were  to 
blindly  perform  these  on  a  stable  representatioTt  of  the  stn;/.tm'e,  performance  would  be  very 
poor.  On  the  other  baxid,  if  the  reqi^t  itself  were  legged,  the  object  could  restart  from  failure  by 
rq}Iaying  the  request  log.  By  saving  periodic  badoips  of  the  object  state  and  desring  the  log,  the 
cost  of  replaying  it  can  be  kept  small.  The  cost  of  updating  such  a  log  will  be  minor  in  com¬ 
parison  with  the  cost  of  maintaining  the  entire  object  state  in  n  stairle  form.  ISIS  therefore  pro¬ 
vides  the  programmer  with  a  tool  for  maintaining  a  log  transactiomlly.  It  is  possible  to  log  any 
RPC  request.  The  capability  and  arguments  are  written  to  the  log  and  later  can  be  replayed  by 
some  other  transaction.  If  a  transaction  aborts,  log  entries  it  has  tiatie  are  deleted.  A  conse¬ 
quence  is  that  the  performance  of  the  stable  storage  mechanism  is  not  a  bottleneck  in  ISIS. 

5.2.  The  comnumkation  subsystem 

The  ISIS  ccHimunicntion  subsystem  provides  three  types  of  broadcast  protocol  for  transmit¬ 
ting  a  message  reliably  frem  a  sender  process  to  some  set  of  destinations.  The  protocols  are 
atamic  in  an  “all  or  nothing”  sense:  if  any  component  of  an  object  reedves  a  th-n  "nles» 
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it  fails,  all  operadonal  ccmponents  will  receive  it.  Atomic  broadcast  has  often  been  proposed  as  a 
basic  primitive  &om  which  highex  level  system  services  can  be  constructed,  and  several  protocols 
for  realizing  such  broadcasts  have  been  reported  in  the  literature  [Sefaneida]  [Chang]  [Cristian]. 
IMortusately,  although  the  number  of  packets  transmitted  to  deliver  a  message  is  low  in  the  pub* 
lished  protocols,  the  latency  before  message  de^very  takes  jdace  is  potentially  high  in  comparison 
to  average  intersite  message  traxrsh  times,  primarily  because  they  enforce  a  global  message 
delivery  ordering  in  addition  to  the  atomicity  property  given  above:  broadcasts  are  received  in  the 
same  order  everyi^iiere  in  the  system.  Sudi  strong  ordering  is  only  needed  rarely  in  ISIS.  To 
overcome  this  problem,  our  protocols  achieve  varying  degrees  of  order,  and  have  latency  that 
varies  accordingly.  Moreover,  unlike  the  previously  reported  work,  our  protocols  are  int^rated 
with  a  mechanfam  for  dealing  witii  failure  and  recovery  at  the  levd  of  incfividual  processes.  We 
new  summarize  our  approach,  but  omit  the  detailed  protocols  and  conectness  prooh,  winch  can 
be  found  in  [Biiman-c].  Fig.  1.  illustiates  a  scenario  in  which  two  clients  interact  with  a  proems 
group  while  its  membership  changes  dynamically;  the  interactiom  are  labeled  with  the  type  of 
primitive  that  would  probably  be  used. 


^roua  View 


Fig.  1:  Two  clients  interact  with  a  procen  group  using  the  btxttdcect  prtniUves. 
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5.2.1.  BroadcMt  primttbcs 


Thf!  GDCAST  prtmlihri 

CBCAST  (group  broadcast)  is  the  most  constrained,  and  costly,  of  the  three  primitives.  We 
will  say  that  the  operational  oacTXinents  of  an  object  form  a  process  group.  CBCAST  trammits 
information  about  failures  and  recoveries  to  proo^  group  members.  A  recoveriing  compooe  tt 
uses  CBCAST  to  inform  the  operational  ones  that  it  has  become  available.  Addhionally,  when  a 
component  fails,  die  system  arranges  for  a  CBCAST  to  be  issued  to  group  members  on  its  behalf, 
informing  them  of  its  failure.  Arguments  to  CBCAST  are  a  message  and  a  process  group  identif¬ 
ier  (a  amiability  on  the  resilient  object),  which  is  automatically  translated  into  a  set  of  destinatiens. 

Our  CBCAST  protocol  ensures  that  if  any  process  receives  a  broadcast  b  before  receiving  a 
CBCAST  g,  then  aD  overlapping  destinations  will  reorive  b  before  g.  This  is  true  regardless  of  the 
type  of  broadcast  b.  Moreover,  when  a  failure  occurs,  the  corresponding  CBCAST  message  is 
delivered  after  any  other  broadcasts  from  the  failed  process.  Each  component  can  therefore  main¬ 
tain  a  view  listing  the  membership  of  the  process  group,  updating  it  when  a  CBCAST  is  received. 
Although  views  arc  not  imxiated  simultaneously  (in  realtime),  aO  components  observe  the  Mme 
sequence  of  view  changes.  Moreover,  all  components  will  receive  a  given  broadcast  message  in 
the  same  view^. 

Intuitively,  the  view  represents  a  logical  state  in  which  the  message  arrived  simultaneously  at 
all  available  components.  This  may  not  be  the  same  as  the  set  of  operational  components, 
because  some  may  still  be  executing  the  recovery  algorithm  (Sec.  5.3.2).  A  component  of  a  resi¬ 
lient  object  can  take  advantage  of  this  to  pick  a  strateg}'  for  processing  an  incoming  request,  or  to 
react  to  failure  or  recovery  without  rumring  any  special  protocol  Grst.  Although  these  other  com¬ 
ponents  may  not  have  received  the  message  yet  or  obscr/ed  the  failure  or  recovery,  since  the 

'A  problem  arises  with  this  definiticxi  if  a  process  p  fails  withexu  receisl.nj  some  message  after  that  message  has 
already  been  delivered  to  some  ocher  process  (j  q'i  sdew  would  shervp  ro  be  operaticnal;  hence,  q  will  assume  that  p 
received  the  message,  although  p  is  ph^’sicily  in:ap.itle  of  (i7.ng  so.  However,  tise  state  of  the  system  is  now 
equivalent  to  csw.  i.n  ■vhioh  p  did  receive  the  message,  bur  failed  before  acting  on  it.  In  effect,  there  eriscs  an  interpre¬ 
tation  cf  the  actual  system  state  that  is  consistent  with  q'i  assumpdoo. 
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broadcast  primitives  me  atomic  they  will  eveah'ally  do  so,  and  since  the  CBCAST  ordering  is  the 
same  every^-herc  their  actions  will  aU  be  consistent.  Notice  that  CBCAST  provides  on  inc3q:'.asivc 
way  to  determine  the  last  site  that  failed:  process  group  members  simply  record  eadi  new  view  on 
stable  storage;  a  simplified  version  of  the  algorithm  in  [Skeesha]  can  thus  be  executed  when  recov¬ 
ering  from  failure. 

Thz  BCAST  prinrfthre 

The  CBCAST  primitive  is  too  costly  to  be  used  for  general  communicatian  betweai  the  \iro- 
cess  group  membns  that  make  up  a  resilient  object.  This  motivates  the  introduction  of  wui':er 
(less  ordered)  primitives  which  might  be  used  in  situations  where  a  total  order  on  broadcast  m£v 
sages  Is  not  necessary.  Our  second  primitive,  BCAST,  satisfies  such  a  weaker  constrainL  specifi¬ 
cally,  it  is  often  desired  that  if  two  broadcasts  are  received  in  some  order  at  a  common  destination 
site,  they  be  received  in  that  order  at  all  other  common  destinations,  even  if  this  order  was  not 
predetermined.  For  example,  the  ISIS  heap  facility  maintains  replicated  allocation  and  free  lists 
for  each  transaction  by  transmitting  each  heap  operation  to  aD  copies;  since  the  operations  mi 
done  in  the  same  order  everywhere,  the  lists  are  mutually  consistent.  The  primitive 
BCAST(m3g,  label,  dests),  whae  msg  is  the  message  and  label  is  a  string  of  characters,  provides 
this  behavior.  *rwo  BCAST$  having  the  some  label  are  delivered  in  the  same  order  at  all  common 
destinations.  A  BCAST  having  the  label  is  oider^  with  respect  to  all  other  BCASTi.  Os  the 
other  hand,  BCASTi  with  different  labels  can  be  delivered  in  arbitrary  order.  This  relaxed  syn^ 
chronization  results  in  potentially  better  performance. 

The  OBCAST  primitive 

Out  third  primitive,  OBCAST  (ordered  broadcast),  is  weakest  in  the  sense  that  it  involves 
less  distributed  synchronization  then  CBCAST  or  BCAST.  OBCASTlrvg,  dexts)  atomically  oelivcrs 
msg  to  eadi  operai  onal  dest  so  that  if  one  process  sends  multiple  messages  to  thu  .tame  dcstina- 
don,  they  arc  delivered  in  the  order  they  were  sent.  Delivery  ordering  is  unconstrained  if  two 
broadcasts  originate  in  different  processes  or  arc  issued  concurrently  within  a  single  process.  More 
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specifically,  if  there  exists  a  chain  of  message  transmissions  and  receptions  or  local  events  by 
which  knowledge  could  have  been  transferred  from  the  point  at  which  the  fint  bf(»dcBSt  was 
issued  to  the  point  at  wfaidi  the  second  one  issued,  we  consider  the  broadcasts  to  be  potentiaUy 
causally  relat  d,  and  the  deiiveiy  ordering  will  respect  the  order  of  transmission.  For  causal^' 
independent  broadcasts,  the  delivery  ordering  is  not  constrafned. 

OBCA&T  tt  vahiahle  is  ISS  because  resiiicm  objects  cin}4oy  coocraa^  ociihx)!  algoithirA 
for  distributed  syncfaronizatioa  A  consequence  is  that  if  two  computations  communicate  con* 
currently  with  the  same  process,  the  messages  are  almost  always  independent  ones  that  can  be 
processed  in  any  order:  otherwise,  concurrency  control  would  have  caused  one  to  pause  until  the 
other  was  fisisbed.  On  the  other  hand,  order  is  dearly  important  within  a  causally-linked  soies  of 
broadcasts,  aiKl  it  is  prodsdy  ttds  sort  of  order  that  OSCAST  respects. 

5.2.2.  Other  broadcast  abetracthms 

A  weaker  broadcast  primitive  is  reliable  broadcast,  which  provides  all-or-nothing  delivery, 
but  no  ordering  properties.  The  formulation  of  OSC'ST  in  [Binoan-b]  actually  indud;»  a 
irecfaaniam  for  performing  broadcasts  of  this  sort,  hence  no  special  primitive  is  needed  for  the 
purpose.  Additionally,  there  may  be  situations  in  which  ^CAST  protocols  that  also  satisfy  an 
OBCAST  ordering  property  would  be  valuable.  Although  our  BCAST  primitive  could  be  changed 
to  respect  such  a  rule,  when  we  considered  the  likely  uses  of  the  primitives  it  seoned  that  BCAST 
was  better  left  completely  orthogonal  to  OBCAST.  In  situations  needing  hybrid  ordering  behavior, 
the  protocnis  of  [Binnan-b]  could  easily  be  modified  to  unplsiicux  BCAST  iS  tsnss  cf  GSCAST, 
and  the  resvlting  protocol  would  behave  as  desirecL 

5.2  Synchronous  versua  asynchronous  broadcast  abstractlona 

Many  systems  etnploy  RFC  internally,  as  a  lowest  level  primitive  for  interaction  berween 
processes  (this  type  of  RPC  should  not  be  confused  with  the  high-level  RPC  primitive  used  to 
covnmunicate  with  and  betvrcen  resilient  objects).  It  should  be  evident  that  all  of  our  broadcast 
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primitives  can  be  used  to  implement  rq)licsted  remote  procedure  calls  [Cooper]:  the  caller  would 
simply  pituse  until  replies  have  been  received  from  all  the  participants  (observation  of  a  failure 
constitutes  a  reply  in  this  case).  We  term  such  a  use  of  the  primitives  synchronous,  to  distinguish 
it  firem  from  an  asynchronous  broadcast  in  which  no  replies,  or  just  one  reply,  suffices. 

In  ISIS,  G6CAST  and  BCAST  are  normally  invoked  synchronously,  to  implement  a  ronote 
procedure  call  by  one  caaqxacat  of  an  object  on  .iD  wbe  members  of  its  process  group.  However, 
OBCAJSt,  wfaiefa  is  the  most  frequently  used  overall,  is  almost  never  invoked  synchronously. 
Asyndsonoui  OBCASTs  are  the  source  of  most  concurrency  b  ISIS:  although  the  delivery  order¬ 
ing  is  assured,  trarunission  can  be  delayed  to  take  advantage  of  piggybaddng  or  to  schedule  L'O 
withb  the  system  a  a  whole.  While  the  system  earshot  deie^  such  a  broadcast  indefinitely,  the 
ability  to  defer  h  a  little,  without  delaying  some  ccanputation  by  doing  so,  permits  lo&d  to  be 
smoothed.  As  observed  above,  although  concurrency  is  mtroduced  by  the  primitive,  it  respects 
the  delivery  ordeiings  on  which  a  computation  might  depend,  and  is  ordered  with  respect  to 
failures,  sc  this  ooncurrenciy  does  not  complicate  higher  level  algorithms.  Moreover,  the  protocol 
itself  is  extremely  cheap. 

A  problem  is  mtrcxfaiced  by  our  decision  to  aCow  asynchronous  broadcasts:  the  atomic  recep¬ 
tion  pi^cperty  must  now  be  extended  to  address  causally  related  secjuences  of  asynchronous  mes¬ 
sages.  If  a  failure  were  to  leave  a  “gap  ’  b  such  a  secpience,  such  that  some  broadcasts  were 
delivered  to  aU  their  destiuatioQS  but  others  that  precede  them  were  not  delivered  anywhere, 
inconsistency  might  result  even  if  the  destinations  do  not  ovalap.  We  therefore  extend  the  atomi¬ 
city  property  as  follows.  If  process  t  receives  a  message  m  from  prooe'^s  r,  nod  s  subsecpisntly 
fails,  than  the  state  of  r  may  depend  cm  any  message  m'  received  by  s  before  it  sent  There¬ 
fore,  imlew  t  fails  sa  well,  m'  must  be  delivered  to  iti  ’.tanainbg  destinations.  The  cost  of  the  pro¬ 
tocols  are  not  affected  by  this  change. 

A  second  problem  arises  when  the  user-level  implicatiems  of  this  atomicity  rule  are  con¬ 
sidered.  In  the  event  of  a  failure,  any  suffix  of  a  sequence  of  aysnehronous  broadcasts  could  now 
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be  lest  and  the  system  state  nxmld  stiQ  be  intemaDy  consistent.  A  coordinator  tt^t  is  about  to 
send  a  top-level  reply  or  take  some  action  that  may  leave  an  externally  visible  side-e^ect  v.ill 
ther^ore  need  a  way  to  pause  untii  all  such  broadca'ts  have  actually  been  delivered.  For  this  pur¬ 
pose,  a  Sas!i  primitive  is  provided  within  the  dsject  specification  language.  Notice  that  cocasional 
c&Ss  to  do  not  eliminate  the  benefit  of  uwng  OBCAST  asynchronously.  Ualess  the  system 
hM  built  up  a  oounderable  baddog  of  undelivered  brr'..JCKSt  messages,  which  should  be  rare,  Shiah 
will  only  pause  wfafie  transmissim  of  the  last  few  broadcasts  completes.  Flosu  is  automatically 
invoked  when  a  log  entry  is  written. 

5.3.  B'aah-toSsnnit  hnpleinentation  of  selected  opersthms 

In  this  section,  implementadons  are  described  for  some  of  the  operations  that  occur  meat 
frequently  within  ISIS,  using  the  prumtives  given  above.  In  the  interest  of  brevity,  only  a  small 
subset  of  ISIS  is  presented. 

5.3.1.  Object  invoeatioii  and  reqscot  processing;  commit  and  abort 

To  issue  a  request  to  an  object,  a  task  first  generates  the  transaction  id  under  winch  the 
desired  operation  should  he  czocuted.  If  a  non-resilient  process  is  performing  the  RPC,  a  new 
top-level  transaction  is  created  and  a  unique  identifirr  is  assigned  as  its  HD.  If  a  task  with  HD  x 
does  a  series  of  RPCs,  HD’s  for  the  resulting  subtransactians  are  formed  by  extending  x  with  an 
index:  x.l,  x.2,  etc.  The  brandies  of  &  oobegin  are  assigned  HD’s  in  the  same  manner.  Finally,  if 
a  toplevel  statement  is  executed,  a  HD  is  generated  as  for  a  subtransacdon  but  the  prefix  is 
fiagged  as  a  top-level  event. 

Having  determined  the  HD,  the  caller  asynchronously  OBCASTTs  the  RPC  to  the  com¬ 
ponents  of  the  destination  object.  A  cqxibility  mamgemeiu  facility  translates  the  capsibilit)'  into  a 
list  of  process  addresses  for  transmission^.  The  caller  then  waits  for  a  single  reply,  which  could 
come  from  any  cemponent  of  the  object,  or  even  arrive  in  duplicate  because  of  failures. 

’An  inexpensive  protocol  to  nwtain  a  cache  of  group  addressing  ujfonuaa'jn,  updating  it  if  it  is  fevnd  to  be  out 
of  dace  during  message  transnassicn,  is  given  in  [Biiman-c]. 


Duplicates  are  discarded 

Upon  receiving  the  RPC,  a  component  must  determine  if  it  b  the  coordinator.  All  com¬ 
ponents  of  each  object  are  statically  ordered  by  site  number  into  a  ring.  A  component  computes 
its  rmtking  as  the  distance  along  the  ring  frcan  the  site  where  the  RFC  originated;  the  coordinator 
is  denned  to  be  the  lowest  ranked  qseratinnal  component.  This  tends  to  plan  the  coordinator  at 
the  same  site  as  the  oriQnator,  which  is  desirable  because  it  mimmizes  the  latency  incurred  before 
a  result  can  be  returned.  Tbe  new  coardinator  returns  a  retained  result  if  one  is  found.  Other¬ 
wise,  it  executes  the  new  request  and  asyncfaroncmsly  OBCnST%  the  result  to  the  caller  and  its 
cohorts.  A  cohort  watches  the  coordinator  for  failure,  ^^hiefa  it  detects  by  reception  of  a  CBCAST 
message,  and  recon^nites  the  ranking  if  one  occurs.  Since  all  components  have  the  same  view 
when  an  RFC  is  received,  and  eD  subsequently  see  the  same  sequence  of  failures  and  recoveries, 
the  computed  rankings  are  mutually  consistent.  Note  that  all  necessary  svnehronization  is  pro¬ 
vided  by  tbe  communication  primitives. 

Now,  considei'  task  termination.  For  each  task,  a  capability  list  (GUST)  b  maintained,  con- 
taining  the  capabilities  of  objects  whose  oo^:^xlnents  should  be  ir'onned  when  the  task  commits  or 
aborts.  The  coordinator  uses  OBCAST  to  asynchronously  send  a  commit  or  abort  message  to  tbe 
objnrts  in  the  dlST.  A  CLIST  b  initially  empty;  a  capability  b  added  vyfaen  an  RPC  b  issued  to 
an  object.  Additionally,  when  a  reply  b  received  from  a  committed  subtransaction,  the  OUST  for 
that  subtransaction  b  piggybacked  on  the  reply  and  merged  with  that  of  the  caller  (unless  the  sub¬ 
transaction  executed  in  a  toplerd  statement,  in  which  case  its  CLIST  b  discarded  when  it  com¬ 
mits). 

On  reception,  a  commit  or  abort  message  for  transaction  T  b  delayed  if  some  subtransaction 
of  r  b  still  active.  Thb  make'  it  possible  for  a  'ubtransaction  to  reply  to  its  caller  before  issuing 
its  own  commit  or  abort,  a  tactic  that  reduces  latency  and  ensures  that  at  least  one  copy  of  the 
reply  will  readi  the  caller  (a  duplicate  might  be  sent  if  the  coordiiiator  feib  after  sending  tiie  reply 
but  before  sending  the  commit).  After  all  subtransactions  have  terminated,  retained  results 
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correspondmg  to  7  are  deleted,  and  the  local  lock  manager  and  version-stack  managers  are 
informed  of  the  event.  When  a  klQ  is  received,  if  the  coorcUnator  is  doing  a  restart  it  waits  until 
restart  is  completed  (to  ensure  that  the  CLIST  is  accurate).  Snee  restart  is  done  without  blocking, 
it  win  terminate.  Km  is  then  forwarded  to  any  active  subtransactions,  and  an  abort  is  perfonx'ed. 

SJ.2.  Recovery  Grom  partial  bfiora 

To  initiate  recovery,  a  componem  issues  a  GBCAn"  to  the  operational  components  of  the 
object  to  which  it  belongs.  When  tfab  mesi.ge  is  received,  any  component  transfers  its  state  to 
the  recovering  one:  since  tlie  states  of  the  operational  components  are  determined  by  the  messages 
they  have  received,  and  each  has  received  the  same  set  of  messages,  all  are  in  the  same  Oosical) 
state.  This  GBCAST  can  thus  be  thought  of  as  a  synchronous  RPC  that  returns  the  current  state 
of  the  object  and  has  the  side-iriect  of  modifying  the  process-group  view  to  indude  the  recovered 
component.  The  total  ordering  of  GBCAST  with  respect  to  other  broadcasts  provides  all  the 
necessary  synchronization. 

5.3.3.  ManHghig  repScatNS  locks 

The  locking  facilities  discussed  earlier  are  easily  implemented  using  our  broadcast  primitives. 
A  read-lock  is  first  obtained  locally  by  the  coordinator  of  a  computation.  Then,  a  read-lock  regis¬ 
tration  message  is  asynchronously  OBCAST  to  the  other  copies  of  the  data  item.  The  coordinator 
immediately  continues  execution,  as  if  its  read-lode  were  already  replicated,  although  the  message 
may  not  actually  have  been  delivered  anywhere.  If  the  coordinator  fails  and  any  process  had 
received  a  message  m  sent  after  the  lock  acquisition,  the  read-lock  will  be  registered  before  the 
failure  can  be  “detected”  by  the  cohorts  managing  other  copies  of  the  lock.  Ihis  follows  tocause 
the  read-lock  registration  precedes  m  and  uence  must  be  d-^vered  despite  the  failure,  whereas  (by 
definition)  the  GBCAST  follows  m  and  hence  must  be  delivered  after  the  lock  registration. 
Because  the  read-lock  registration  message  is  small  and  asviichrcnous,  piggybacking  such  messages 
on  outgoing  updates  apfi  RPC  messages  is  particularly  effective. 
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Unlike  a  read-lcrk,  a  write-lock  must  be  granted  explicitly  by  all  con^jonents  of  an  object, 
except  in  certain  special  cases'^  described  in  [Raeucfale].  Moreover,  a  write-lock  req'^t  can  be 
performed  correctly  whether  or  mit  other  broadcasts  issued  by  the  computation  have  been 
delivered,  hence  the  request  is  not  subject  to  the  OBCAST  type  of  oidcriiig  constraint.  Note, 
though,  that  if  two  write-lodr  requests  are  issued  concurrently  (on  the  same  item),  they  could 
deadlodt  simply  by  being  granted  in  different  orders  at  different  sites.  TUs  is  just  the  type  of 
ordering  probl^  addressed  by  BCAST.  To  nrqidre  a  wrhe-lock.  the  request  is  synchronously 
BCAST  using  the  identifier  of  the  data  item  as  a  BCAST  label.  If  a  component  fails  during  the 
protocol,  the  caller  withdraws  the  partially  acquired  write-lock  and  then  rerequeats  it.  Since  the 
read-lock  registration  message  preceded  the  GBCAST  anneundng  the  failure,  either  it  is  deUveied 
before  the  GBCAST,  or  no  site  received  a  message  &om  a  failed  ooerdinator  after  it  obtained  the 
lock.  Because  of  the  withdrawal  rule,  the  write-lock  is  rerequested  the  GBCAST  message  u 
received,  so  it  will  be  forced  to  wait  if  the  coordinator  held  a  read  lock  and  that  lock  survived  the 
cash.  Moreover,  since  BCAST  is  delivered  in  the  same  order  everywhere,  concurrent  write-lock 
requests  will  not  deadlock. 

S  J.4.  Updating  repHcatcd  data 

Read  operations  are  satistied  from  the  version  stock  for  the  local  copy  of  the  data  item  being 
accessed.  Three  implementations  are  supported  for  write  operations. 

Synchronous  update. 

For  this  method.  GBCAST  1%  iised  to  synchronously  transmit  the  new  value  to  all  operational 
components.  The  method  is  only  used  for  experimental  evaluation  of  the  effect  of  asyiKfaronous 
data  transmusion  on  performance,  as  reported  in  Sec.  7.  Note,  however,  that  if  ISIS  used  a 
quorum  replication  roethod,  both  read  and  write  operations  would  effectively  be  synchronous. 
Thus,  the  perf'w'mance  of  synchronous  update  sheds  light  on  the  performance  attainable  with  a 

*rhe  nxBt  imponant  d  these  is  that,  since  coordinatars  for  a  single  traniachan  are  run  a:  the  same  site,  after  a 
transaction  has  acquired  a  distributed  write  lock  on  an  item  x,  its  subtransactions  need  only  lock  x  locally. 
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quorum  rqjlicatsoi)  msthod. 


CojKsurent  npdais. 

Althougb  synchronous  update  is  concq}.ual]y  simple,  costly  delays  are  incurred  'while  waiting 
for  acknowledgements.  Using  concurrent  updau,  data  are  'dated  locaDy  by  the  coordinator, 
which  issues  an  t  ^ynchroncus  OBCAST  to  inform  its  oohcrts  [Josq)h-b].  An  iisynd!i'om>us 
OBCAST  is  also  used  to  commit,  at  wfakfa  thne  lods  are  released.  Since  the  updates  precede  the 
commit  and  GBCAST  respects  this  crderisg,  any  process  that  obtains  a  lode  wiD  observe  the 
correct  version  of  the  data  it  reads.  Thus,  the  semantics  of  the  synchronous  update  are  preserved 
but,  if  few  write-locks  are  needed,  the  response  time  is  limited  by  tk  local  execuiion  speed  of  'oie 
request!  Recall  that  when  concurrent  update  is  in  use,  it  may  be  necessary  to  invoke  flash  before 
returning  a  rmult  from  a  top-level  operation  or  taldng  an  action  with  external  side^ffects. 

Delayed  apdato 

The  concurrent  update  scheme  assumes  a  pessimistic  write^locking  algorithm,  which  waits  for 
responses  from  aU  operational  components  each  time  a  write>iock:  is  needed.  Pessimistic  locking 
permits  the  programmer  to  design  a  deadUcck-free  object  and  hence  to  implement  objects  that  take 
irreversible  actions.  However,  better  performance  can  sometimes  be  obtained  using  an  optimistic 
locking  algorithm  together  with  delayed  updating.  Write-locks  are  acquired  locally  by  the  coordi¬ 
nator,  which  queues  update  messages  but  does  cot  transmit  them.  When  the  transaction  is 
prepared  to  commit,  it  attempts  to  acquire  these  write  locks  from  its  cohorts  using  the  protocol  of 
Sec.  5.3.3.  The  iransaction  aborts  (discarding  its  queued  updates)  if  deadlock  would  result.  Oth¬ 
erwise,  it  transmits  the  updates  using  OBCAST. 

Using  delayed  updates,  the  possibility  of  an  occasional  abort  is  accepted  as  an  alternative  to 
issuing  multiple  write-lock  requests  -  only  one  distributed  concurrency  control  action  is  needed, 
and  it  occurs  at  the  end  of  the  transaction.  Moreover,  other  transactions  can  read  old  versions  of 
any  data  items  being  updated  (but  only  at  remote  sites)  and  multiple  updates  could  be  sent  in  each 
message.  These  benefits  have  a  cost;  large  amounts  of  buffering  may  be  needed  to  support  the 


technique,  end  irreversible  actions  arc  precluded  The  ISIS  prototype  will  be  used  to  compare 
delayed  and  concurrent  update  in  the  future.  Both  update  methods  have  been  proved  correct  for 
objects  that  use  conflict-serializability  as  a  oorroctness  constraint  [Josqjb-a],  An  open  problem  is 
to  investigate  the  qiplicability  of  these  techniques  in  system  which  employ  other  correctness  oxn- 
straints. 

d  ryitiia  crchftacixnrB  and  Imptenentation  Impm 
6.1.  Comtannkstloa  primlthra 

We  now  summarize  the  architecture  and  implementation  of  the  ISIS  communication  subsys¬ 
tem.  The  primitives  are  built  in  layers,  starting  with  a  “bare'*  network  providing  unreliable 
datagrams.  A  site-to-site  acknowledgement  protocol  converts  this  into  a  sequenced,  CRor-free 
message  abstraction,  vsin^  Smeouts  to  detect  apparent  failures.  An  agreement  protocol  is  thm 
used  to  convert  the  site-failum  and  recuveties  into  an  agreed  upon  ordering  of  events.  If 
timeouts  cause  a  failure  to  be  detected  erroneously,  the  protocol  forces  the  affected  site  to 
undergo  recovery. 

Built  on  this  is  a  layer  that  supports  the  various  primitives.  OSCAST  has  a  very  light-weight 
implementation.  Each  process  buffet's  copies  of  any  messages  needed  to  ensure  the  consu'tcncy  of 
its  view  of  the  system.  If  message  m  is  delivered  to  process  p,  and  some  message  m'  precedes  m, 
a  copy  is  sent  to  p  also.  Thus,  if  any  chain  of  process-to-process  interactions  leads  to  the  intended 
recipient  of  a  message,  the  message  will  tra  d  down  that  chain  and  can  be  delivered  (diqtlicute 
copies  are  discarded).  An  inezptensive  garbage  collector  tracks  down  and  ddetm  superfluous 
copies  after  a  mcaangc  has  all  its  destinations.  By  using  extensive  piggybacking  and  a  sim¬ 

ple  sdieduling  algorithm  to  control  message  transmission,  the  cost  of  an  OBCAST  i*  kept  low  - 
often,  less  than  one  packet  per  destination.  BCAST  emplo>'s  a  two-phase  protocol  based  on  one 
suggested  to  us  by  Skeen  [Skeen-b].  This  protocol  has  higher  cost  than  OBCAST  because  deliver^' 
can  only  occur  during  the  second  phase;  BCAST  is  thus  inherently  synchronous.  Recall,  however, 
that  ISIS  uses  BCAST  primarily  for  write-lock  acquisition,  whidi  can  be  done  rardy.  Moreover, 


Page  26 


3CASr  U  only  need&cl  tlw  Gist  time  a  ocmputaGon  write-Ioda  a  paiticulai  variaclt^,  nibsequent 
attempts  to  re-Iod:  it  (not  ttncnmmon  in  nested  transactions)  can  be  handled  locally.  GBCAST  is 
implemented  using  a  two-phase  protocol  similar  to  the  one  for  BCAST,  but  with  an  additional 
m.yjMniitin  that  Gushes  meseagea  from  a  faOed  process  beforc  delivering  the  GBCAST  annoinicing 
the  failure.  Although  GBCAST  is  dow,  it  is  used  very  rarely.  More  details  aitd  correctness  proofs 
appear  in  [BiimBn<]. 


The  ISIS  prototype  was  built  under  UNIX  4.2.  and  is  organized  hierarchically,  as  illustrated 
in  Fig.  2.  The  lowest  level  provides  the  comiminication  primitives  described  earlier,  together  with 
a  message  “editing”  subsystem  supporting  vari^sb-fonnat  messages  with  lymbolkaQy  named 
message-Gelds.  Built  on  top  of  tins  is  the  a  layer  supporting  concurrent  tasks,  monitors  for  mutual 
exclusion,  the  transactional  version  stack,  the  lock  manager,  the  capability  manager  [Dietrich], 
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which  maps  a  capability  on  as  object  to  a  list  of  sites  where  its  components  reside,  the  name¬ 
space,  which  maps  symbolic  names  to  capabilities,  and  the  interface  used  by  external  non-resilient 
precedes  to  issue  requests  to  resOio*^  objects.  The  capability  mana^  supports  d>’nan:ic  migra¬ 
tion  of  obj.xt3,  althou^  we  do  not  yet  exploit  this  pc^sibility. 

We  originally  feared  that  processes  md  inter-process  ccanmunicarion  would  be  the  dominant 
cost  factor  in  ISS.  Consequently,  a  single  system  process  handles  functions  common  to  all  resi¬ 
lient  objects,  and  a  single  “type  manager'*  is  used  for  each  resilient  type.  A  type  manager  multi¬ 
plexes  its  time  between  the  different  instances  of  its  type  residing  at  the  site  where  it  is  executmg; 
these  in  turn  multiplex  their  time  among  currently  active  taslts.  Process  creation  occurs  only  when 
a  new  type  manager  must  be  started  (this  idea  was  suggested  in  [Lazowska].)  Commands  to 
interactively  load  and  unload  type  managen  (e.g.  when  a  new  type  is  defined)  are  provided  by  the 
system  process. 

In  retrospect,  we  feel  that  the  decision  to  multiplex  type  managers  was  an  error.  The 
increased  code  complexity  required  to  keep  separate  copies  of  the  various  data  structures  used  in 
the  type  manager  for  each  instance  vms  not  justified  by  the  red>ioed  overhead  that  resulted.  In 
any  future  version  of  ISIS,  each  object  will  be  represented  by  a  single  process  at  every  site  where 
it  resides  and  the  runtime  system  will  be  fragnsented  into  multiple  processes:  a  process  group 
manager,  a  protocols  process,  a  failure  detector,  communication  buffering  processes,  etc.  We  now 
believe  that  this  would  reduce  complexity  and  that  adverse  performance  impact  can  be  minimized 
with  careful  tuning.  We  also  believe  that  if  ISIS  is  to  perform  weD,  it  must  be  moved  away  from 
UTnIX,  and  are  planning  to  do  so  in  the  future. 

7.  Performaiice  of  the  prototype 

A  prototype  of  ISIS  has  been  operational  since  January  1985.  Performance  is  reported  for  a 
duster  of  SUN  2/50  workstations  interconnected  by  a  10-Mbit  ethemet  (Table  1).  Our  approach 
was  to  e'.'aluate  the  performance  of  the  communication  primitives,  the  response  time  for  some 
simple  resilient  objects,  and  the  overall  response  time  of  the  system  when  presented  with 
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concurrent  requests  at  multiple  sites.  Tlie  indexed  sequential  file,  built  from  a  resilient  directory 
and  a  resilient  file,  illustrates  the  overhead  associated  with  nesting. 

When  we  instrumented  ISIS,  we  discovered  that  the  performance  of  our  IPC  connections  was 
suboptimal,  primarily  because  the  version  of  UNIX  we  used  did  not  siqtport  changes  to  tbs  IPC 
bu^er  size,  ^idiich  was  consequently  too  small  to  permit  effective  “windowing”.  We  lacked  the 
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Delay  to  reception 
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ISas 

165na 

360a 

Truuuahpux  (one  task) 

10 /sec 

6/  sec 
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Table  1:  Perfonnance  b  the  ISIS  Prototype 


resources  to  correct  this  problem. 

The  first  set  of  figures  addresses  performance  of  the  version  store  and  lock  manager.  These 
show  that  while  the  version  store  is  very  fast  in  its  in<ore  partial  recovery  mode,  it  degrades  in 
the  disk>bjued  “st^e”  storage  mode.  This  supports  our  decision  to  favor  log'based  recovery 
from  total  failures,  since  the  use  of  stable  storage  is  minimized  in  this  manner.  Consequently,  the 
resilient  objeu..  tested  weje  run  in  the  in-oore  mode  only. 

The  broadcast  primitives  are  dominated  by  underlying  message-passing  costs,  but  otherwise 
depend  primarily  on  the  number  of  phases  required.  In  the  initial  implementation  of  the  primi¬ 
tives,  all  run  in  two  phases  (although  the  message  is  deltvr.fed  during  the  iiist  one  for  OBCAST 
and  the  second  for  BCAjT),  hence  aQ  the  primitivea  give  similar  perfarmance.  The  latency  figure 
measures  the  time  from  message  transmission  to  remote  delivery.  Because  the  OBCAST  imple¬ 
mentation  we  instrumented  is  not  identical  to  the  one  described  in  this  paper,  the  OBCAST  latency 
is  very  high.  Moreover,  the  latency  figure  turned  cut  to  be  very  hard  to  measure:  using  a  60Hz 
line<lock,  which  is  the  only  one  available  on  our  SUN  workstations,  elapsed  time  can  only  be 
measiued  to  an  accuracy  of  16ms.  Nonetheless,  the  OBCAST  latency  (32ms  in  the  B-site  case)  is 
much  larger  than  the  inter-site  latency  (10ms).  We  found  that  this  results  from  delay  associated 
with  the  I/O  operation  that  occurs  when  an  OBCAST  recipient  adomwledges  delivery  to  the  initia¬ 
tor.  Additional  latency  is  introduced  by  the  small  window  size,  and  the  inaccurate  dock  further 
inflates  the  OBCAST  figure.  We  are  confident  that  after  we  reimplement  the  primitives  miwg  the 
algorithms  given  in  [Birman-c],  OBCAST  latency  will  not  be  much  higher  than  tbs  site-site  latency- 
of  iOms. 

Turnaround  measures  the  delay  from  transmission  to  reception  of  a  reply  from  the  remote 
task  that  received  the  message,  and  throughput  measures  the  rate  at  which  a  sir.glc  task  can  issue 
broadcasts  without  waiting  for  acknowledgements,  in  messages  per  second.  The  effective 
throughput  is  3  to  5  times  higher  than  this,  because  concurrent  update  permits  multiple  update 
messages  to  be  piggybacked  on  a  single  packet  (notice  that  the  effective  throughput  decreases 


mere  slowly  than  the  true  throughput  as  the  number  of  sites  increases:  whOe  waiting  for  ack¬ 
nowledgments,  there  is  time  to  generate  more  concurrent  update  messages,  hence  the  degree  of 
piggybacking  rises).  We  also  measured  the  system  throughput,  which  is  the  maximum  number  of 
BCAST  or  OBCAST  protocols  that  can  be  started  per  second  at  a  site  in  a  steady  state  (this  figure 
could  be  improved  by  tuning  the  UNIX  scheduling  policy).  Note  that  the  cost  of  ±e  protocols 
rises  linearly  with  the  number  of  destinations,  as  least  when  the  number  of  destinations  remains 
small. 

Turning  to  the  resilient  objects  themselves,  we  see  the  dramatic  performance  impact  of  me 
concurrent  update  technique  v:hen  compared  with  synchronous  update.  These  tests  measured  the 
average  cost  per  operauon  for  a  transaction  doing  25  operations  of  the  designated  type.  Con¬ 
currency  control  overhead  is  higher  for  the  first  operation  than  for  subsequent  ones,  which  the  sys¬ 
tem  recognizes  as  being  “covered”  by  previously  acquired  locks.  The  amortized  cost  is  therefore 
low,  permitting  bursts  of  10-12  operations  per-second  even  when  updating  was  bong  done  (again, 
assuming  an  otherwise  idle  system).  The  fact  that  concurrent  update  does  better  than  synchronous 
update  even  in  the  single-site  case  is  because  concurrent  update  is  also  used  to  maintain  message 
routing  tables  in  the  type  managers.  Nesting  did  not  introduce  any  substantial  overhead.  Within 
the  system,  most  time  is  spent  sending  and  receiving  messages  and  in  the  object  itself,  executing 
the  requested  operation. 

HnaUy,  we  measured  the  performance  of  the  file  object  under  a  distrbuted  load.  Con¬ 
currency  control  was  not  included,  in  order  to  isolate  the  effect  of  leplication  from  other  factors. 
Two  types  of  tests  were  undertaken.  Hrst,  we  considered  a  “mixed”  transaction  that  performed  3 
reads  before  doing  a  write  and  committing.  Tue  file  object  was  replicated  at  1  and  4  sites,  and 
varying  loads  of  requests  were  presented  randomly  at  each  site.  Figure  3a  shows  the  mean 
response  time  for  several  thousand  requests,  for  loads  ranging  from  0  to  ‘  operations  per  second. 
Fnch  curve  stops  when  the  systetc  saturated  and  began  to  develop  a  request  backlog.  A  com¬ 
ponent  of  the  file  object  does  two  sorts  of  work  when  processing  these  requests:  coatputing  related 
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Fignm  3at3b:  Rffpoiiitvaiai  of  a  (Qe  ol^ect  aa  a  ftmetioo  of  ha  wotUoad 
to  the  coordinator  side  of  each  operation,  and  work  stemming  from  its  role  as  cohort  in  requests 
initiated  remotely,  "nr  latter  involves  prooesaing  the  initial  RPC  message,  the  message  containing 
the  data  for  the  write,  and  the  commit;  the  former  involves  generating  these  messages  and 
interacting  with  the  external  dient  programs,  in  addition  to  executing  the  operation  itself. 

The  data  we  plotted  was  obtained  by  correlating  response  time  for  individual  requests  with 
the  times  at  which  read  and  write  requests  were  serviced  by  the  ffle  object.  The  overall  load  on 
the  object  was  by  measuring  the  rate  of  local  reads  and  wntes  and  adding  4  times  the  rate 

of  received  from  remote  sites,  to  account  for  the  3  reads  that  were  done  remotely  for 

every  update  sent  out  Note  that  piggybacking  makes  it  possible  for  a  cohort  to  do  quite  a  bit  of 
work  for  each  message  it  receives  from  the  runtime  system;  in  the  case  of  the  coordinator  this  is 
generally  not  the  case.  It  is  imercsting  to  observe  that  except  when  the  load  on  the  object  was 
very  low,  response  time  in  the  4-site  case  is  better  than  that  which  can  be  achieved  with  a  non- 

replica!  object.  This  effect  can  be  explained  by  the  sharing  of  coordinator  and  interface  related 
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activity  amaog  the  components  of  the  objea.  Mcitover,  the  maximum  ca^jadty  of  the  object  to 
perfoim  operations  rises  from  4.25  operatioDS  per  second  is  the  noo-replicated  case  to  7  opera* 
dons  per  second.  As  the  load  rises,  piggybaddng  increases  the  effidescy  of  dse  system,  explaining 
why  the  response  time  drops  from  about  .25  seconds  to  .1  seconds  for  a  typical  operation. 

We  wondered  whet  woidd  happen  if  transactions  did  only  writes.  Figure  3b  shows  how 
response  time  varies  as  a  function  of  load  for  a  transaction  that  does  one  write  and  then  commits. 
In  the  single-site  case  the  performance  of  this  transaction  is  dos'  to  that  for  the  single-site  mixed 
case  (writes  and  reads  have  comparable  local  costs);  in  all  the  replicated  cases,  however,  response 
time  improves  as  the  object  is  placed  under  increasing  load  (the  saturation  point  is  approximately 
the  same,  howeva).  Hus  better  response  time  is  e^laised  purely  by  the  reduced  coordinator- 
related  and  front-end  work  being  doas  by  the  system.  Of  course,  the  oxt  of  ninniug  the  broad¬ 
cast  protocols  rises  with  the  number  of  sites,  and  performance  would  undoubtably  drop  again  for 
objerts  replicated  at  very  large  numbers  of  sites. 

The  major  condusion  to  draw  from  the  above  is  ihat  when  using  concurrent  update,  the 
apparent  performance  of  a  resilient  object  accessible  fron  multiple  sites  can  be  comparable  or 
better  than  for  a  fault-intolerant  single-site  object  of  the  same  type  (our  eacperienoe  with  the  calen¬ 
dar  program  supports  this).  Moreover,  overaD  performance  is  higber  in  read-intensive  settings, 
provided  that  reqviests  arrive  raiulomly  at  the  different  components  and  the  concurrency  control 
algorithm  is  good,  since  reads  are  done  IcxaDy.  On  the  negative  side,  the  steadily  increasing  costs 
of  the  protocols,  especially  BCAST,  suggests  that  data  should  not  be  replicated  to  more  than  3  or  4 
sites  because  concurrency  control  overhead  could  become  excessive.  This  has  Irad  us  to  iirgjle- 
ment  a  data  migration  mechanism  for  ISIS,  which  will  be  described  elsewhere  [Dietrich].  Our  fig¬ 
ures  demonstrate  that  ISIS  is  able  to  provide  powerful  distributed  services  at  suprisingly  low  cost. 
If  an  effort  were  made  to  tune  the  ISIS  prototype  and  the  objects  themselves,  peifcrmancc  could 
probably  be  doubled  under  UNIX,  and  further  improved  by  moving  to  a  mere  streamlined  operat¬ 
ing  system. 


8.  Fatarc  rtacarch 


Ths  ISIS  project  is  now  entering  its  thiii  year.  Two  major  problems  are  receiving  attention: 
mechanisms  for  increasing  availability  during  partitioning,  and  an  investigation  of  die  limits  of 
ooncuneacy  in  systems  subject  to  ordering-based  correctness  .vostraints  [Tosqib-a].  Smultane- 
ously,  we  are  eaamining  uses  for  ISIS  in  higb-level  programming  tods,  which  might  constitute  the 
interface  to  a  new  geasradon  of  operating  system  srsvioes.  /Jso  being  studed  are  iKUities  for 
dealing  with  real-time  events,  replicated  processing  (as  opposed  to  replicated  data),  and  demand- 
based  data  migration  within  k-resilient  objects  rq>licated  at  more  than  k+1  sites.  We  would  alsc 
like  to  build  some  sort  of  application  system  using  ISIS  as  its  base,  for  example  a  critical  care  sys¬ 
tem  for  medical  environments  [Birman-d]. 

Resilient  object  s  too  higb-level  for  many  purposes.  For  example,  if  all  updates  to  a 
given  variable  originate  at  a  smgle  site,  there  are  cheaper  ways  to  mmirtain  that  variable  then  to 
adopt  a  general  purpose  transaction  mechanism.  Recognizing  this,  we  now  expect  to  use  ISIS  in 
an  environment  that  would  also  permit  programmers  to  work  directly  with  fault-tderant  process 
groups.  Usen  could  then  construct  fault-tolerant  software  using  whiefaeva  tools  seem  most  con¬ 
venient. 


A  basic  problem  is  that  ISIS  provides  a  type  of  nvice  and  exhibits  a  collection  of  require¬ 
ments  which  are  very  different  from  those  seen  in  most  contemporary  distributed  programs.  For 
example,  UNIX  assumes  that  interactions  between  processes  will  be  through  RFC  or  virtual  cir¬ 
cuits,  whereas  communication  in  fault-tolerant  distributed  systems  is  strongly  biased  towards 
broadcast  protocob.  A  result  b  that  UNIX  b  simply  not  very  good  at  running  our  software  (the 
V  system  [Cheriton]  might  bs  more  reasonable,  although  we  have  yet  not  considered  porting  ISIS 
to  it).  Qearly,  it  b  beyond  nur  resources  to  conduct  research  into  both  fault-tolerance  and  operat¬ 
ing  system  design.  It  thus  seems  appropriate  to  call  for  renewed  research  into  primitives  and  com¬ 
putational  modeb  at  the  operating  system  level,  and  for  greater  oooperaticn  between  the  designers 
of  these  two  mutually  dependent  classes  of  system. 
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This  pq)er  preseated  an  overiilew  of  the  SSS  project  and  reviewed  the  techniques  it  uses  to 
obtain  fault-tolerant  implenientations  from  abstract  type  spedfkalions.  The  ^ood  performance  of 
a  prototype  supports  our  belief  that  the  apj^oach  will  be  viable  in  diverse  situations.  Meseover,  a 
novel  communicadon  architecture  leads  to  a  system  structure  within  which  rorrectness  arguments 
are  straightforward  despite  the  prescue  of  fashires  and  coLcuneacy. 

We  believe  that  a  new  generation  of  high-level  computing  facilities,  including  ISIS,  is  now 
emerging.  Much  as  virtual  memory  changed  the  engineeiing  of  very  large  systems  in  a  fundamen¬ 
tal  way,  these  facilities  will  fundamentally  change  the  way  that  distributed  software  is  developed, 
and  win  Aereby  enable  research  in  areas  for  which  existing  programming  methodologies  are 
inackquate.  As  the  complexity  and  sheer  size  of  distributol  systems  oentmues  to  grow,  facilities 
of  this  sort  wQ]  be  indispensable. 
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